Downstream Analysis of Synteny Networks with MSA2dist

Kristian K Ullrich

Max Planck Institute for Evolutionary Biology

Background - DNA Evolution

A matter of distance.

Background - Coding Sequences

Synonymous and NonSynonymous Substitutions: nucleotide substitutions might lead to changes on amino acid level.

Typically used to:

  1. Detect and Date whole-genome duplication (WGD) events;
  2. Predict selective pressure acting on a protein;
  3. Predict selective pressure acting on a codon;
  4. Investigate codon usage (R/Bioconductor package coRdon).

MSA2dist

An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.

MSA2dist Features:

  1. Calculates pairwise AA, DNA and IUPAC distances;
  2. Calculates Synonymous and NonSynonymous Substitutions (dN/dS) KaKs_Calcultor 2.0 models (C++ ported to R with Rcpp);
  3. Use pre-calculated MSA or conduct all possible pairwise alignments prior dN/dS calculation;
  4. Calculate average behavior of each codon from MSA.

MSA2dist workflow

Installation

From Bioconductor:

BiocManager::install(version='devel')
BiocManager::install("MSA2dist")


From GitHub:

remotes::install_github("kullrich/MSA2dist")

Example data set

Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.

Source: Pico-PLAZA 3.0

Where to find: the data/ directory in the GitHub repo associated with this presentation.

data
├── clusters.rda
├── codingsequences.rda
└── data_acquisition_cds.md
  1. codingsequences.rda: DNAStringSet object.
  2. clusters.rda: Pre-calculated syntenet clusters.

Example data set: codingsequences

# Load necessary libraries
library(MSA2dist)
library(Biostrings)
library(tidyr)
library(dplyr)
library(stringr)

# Load data set codingsequences
load(here::here("data", "codingsequences.rda"))

# Inspect data
head(names(codingsequences))
[1] "AP00G00010" "AP00G00020" "AP00G00030" "AP00G00040" "AP00G00050"
[6] "AP00G00060"

Example data set: clusters

# Load data set clusters
load(here::here("data", "clusters.rda"))

# Inspect data
head(clusters)
                   Gene Cluster
1 Chlo_CNC64A_028G00030       1
2 Chlo_CNC64A_028G00040       2
3 Chlo_CNC64A_028G00070       3
4 Chlo_CNC64A_028G00080       4
5 Chlo_CNC64A_028G00110       5
6 Chlo_CNC64A_028G00140       6
length(table(clusters$Cluster))
[1] 22605

Fetch coding sequences for a specific cluster

# Get size distribution of clusters
head(table(clusters$Cluster))

 1  2  3  4  5  6 
38 38 37  2  9  4 
# Get first cluster of size 10 (fc10)
fc10 <- which(table(clusters$Cluster)==10)[1]

my_cluster <- clusters %>%
    dplyr::filter(Cluster==fc10)

head(my_cluster)
                      Gene Cluster
1          Oluc_OL13G02120     119
2 Bpra_BPRRCC1105_04G02530     119
3     Osp_ORCC809_13G00870     119
4          Oluc_OL21G00860     119
5         Omed_OM_08G02190     119
6     Msp_MRCC299_06G03520     119

Fetch coding sequences for a specific cluster

# Remove species unique identifier and add as GeneID
my_cluster$GeneID <- stringr::str_split_fixed(my_cluster$Gene, "_", 2)[, 2]

head(my_cluster)
                      Gene Cluster              GeneID
1          Oluc_OL13G02120     119          OL13G02120
2 Bpra_BPRRCC1105_04G02530     119 BPRRCC1105_04G02530
3     Osp_ORCC809_13G00870     119    ORCC809_13G00870
4          Oluc_OL21G00860     119          OL21G00860
5         Omed_OM_08G02190     119         OM_08G02190
6     Msp_MRCC299_06G03520     119    MRCC299_06G03520
# Fetch coding sequences
my_cds <- codingsequences[match(my_cluster$GeneID, 
    names(codingsequences))]

head(my_cds)
DNAStringSet object of length 6:
    width seq                                               names               
[1]  1605 ATGTCGCTCAAGTCCATCAACCC...GCGTGAACATGCGCAAGCGATGA OL13G02120
[2]  1623 ATGTCCTCCTCTCTCCGTTCGGT...GCGGTCGCGGACCGGCTGAATAA BPRRCC1105_04G02530
[3]  1605 ATGTCGCTCAAATCGATCAACCC...GCATCAACATGCGAAAGAAATAG ORCC809_13G00870
[4]  1605 ATGTCGCTCAAGTCCATCAACCC...GCGTGAACATGCGCAAGCGATGA OL21G00860
[5]  1605 ATGAGTTTGAAATCCGTCAACCC...GGGTGAACATGCGCAAACGATAA OM_08G02190
[6]  1707 ATGCAGGGCATGCAAAGCCCCAT...CTCCCGAAAAACTGGGGGAGTAA MRCC299_06G03520

Coding sequence alignment

To calculate a coding sequence alignment for two sequences:

# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_aln
DNAStringSet object of length 2:
    width seq                                               names               
[1]  1623 ATGTCG------CTCAAGTCCAT...GCAAGCGA------------TGA OL13G02120
[2]  1623 ATGTCCTCCTCTCTCCGTTCGGT...GCGGTCGCGGACCGGCTGAATAA BPRRCC1105_04G02530

To calculate dN/dS from this alignment:

# Calculate dN/dS for this alignment; model = Li
cds1_cds2_kaks <- dnastring2kaks(cds1_cds2_aln,
    model = "Li", threads = 1, isMSA = TRUE)
cds1_cds2_kaks
  Comp1 Comp2       seq1                seq2        ka       ks          vka
1     1     2 OL13G02120 BPRRCC1105_04G02530 0.1640926 1.849032 0.0003232055
        vks
1 0.4047548

Other Biostrings::pairwiseAlignment options can be set:

type = "global", substitutionMatrix = "BLOSUM62"
gapOpening = 10, gapExtension = 0.5

Parallelized pairwise coding sequence alignments

To calculate all pairwise combinations:

start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
    model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()

head(my_cds_kaks)
         Comp1 Comp2       seq1                seq2 Method        Ka      Ks
result.1     1     2 OL13G02120 BPRRCC1105_04G02530     YN  0.145529 4.33579
result.2     1     3 OL13G02120    ORCC809_13G00870     YN 0.0328853 2.73769
result.3     1     4 OL13G02120          OL21G00860     YN        NA      NA
result.4     1     5 OL13G02120         OM_08G02190     YN 0.0570134 4.34114
result.5     1     6 OL13G02120    MRCC299_06G03520     YN  0.672847 4.12155
result.6     1     7 OL13G02120          MP09G03570     YN  0.665727 4.08101
             Ka/Ks P-Value(Fisher) Length S-Sites N-Sites Fold-Sites(0:2:4)
result.1 0.0335645    3.77014e-136   1602 324.101  1277.9                NA
result.2 0.0120121    3.37981e-125   1602 313.774 1288.23                NA
result.3        NA              NA   1602 313.942 1288.06                NA
result.4 0.0131333    1.79968e-161   1602 326.422 1275.58                NA
result.5  0.163251      9.8868e-64   1560 243.569 1316.43                NA
result.6  0.163128     5.33907e-66   1560 230.753 1329.25                NA
         Substitutions S-Substitutions N-Substitutions
result.1           443         274.184         168.816
result.2           238          196.59         41.4101
result.3             0              NA              NA
result.4           326         255.985         70.0148
result.5           819         236.086         582.914
result.6           811         226.544         584.456
         Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1                          NA                          NA
result.2                          NA                          NA
result.3                          NA                          NA
result.4                          NA                          NA
result.5                          NA                          NA
result.6                          NA                          NA
         Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1        0.993263                              1.65453:1.65453:1:1:1:1
result.2        0.562657                              2.66376:2.66376:1:1:1:1
result.3              NA                                          2:2:1:1:1:1
result.4        0.929943                              1.39012:1.39012:1:1:1:1
result.5         1.21131                              1.36501:1.36501:1:1:1:1
result.6         1.17091                              1.09442:1.09442:1:1:1:1
                                    GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1 0.537893(0.536969:0.358595:0.718115)       NA   NA            NA    NA
result.2 0.561682(0.562617:0.349533:0.772897)       NA   NA            NA    NA
result.3  0.561371(0.568224:0.35514:0.760748)       NA   NA            NA    NA
result.4   0.542368(0.561682:0.35514:0.71028)       NA   NA            NA    NA
result.5 0.568325(0.560034:0.359348:0.785592)       NA   NA            NA    NA
result.6 0.568442(0.544674:0.359107:0.801546)       NA   NA            NA    NA

How long did it take?

end_time - start_time
Time difference of 7.949049 secs

Transform Ka values into a distance matrix

ka_mat <- my_cds_kaks %>%
    dplyr::select(seq1, seq2, Ka) %>%
    dplyr::mutate(Ka = as.numeric(Ka)) %>%
    tidyr::pivot_wider(names_from = "seq1", values_from = Ka)

head(ka_mat)
# A tibble: 6 × 10
  seq2   OL13G…¹ BPRRC…² ORCC8…³ OL21G…⁴ OM_08…⁵ MRCC2…⁶ MP09G…⁷ MP09G…⁸ OT_13…⁹
  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 BPRRC…  0.146   NA     NA      NA       NA     NA           NA      NA      NA
2 ORCC8…  0.0329   0.151 NA      NA       NA     NA           NA      NA      NA
3 OL21G… NA        0.146  0.0329 NA       NA     NA           NA      NA      NA
4 OM_08…  0.0570   0.146  0.0719  0.0570  NA     NA           NA      NA      NA
5 MRCC2…  0.673    0.699  0.653   0.673    0.715 NA           NA      NA      NA
6 MP09G…  0.666    0.696  0.646   0.666    0.678  0.0486      NA      NA      NA
# … with abbreviated variable names ¹​OL13G02120, ²​BPRRCC1105_04G02530,
#   ³​ORCC809_13G00870, ⁴​OL21G00860, ⁵​OM_08G02190, ⁶​MRCC299_06G03520,
#   ⁷​MP09G03570, ⁸​MP09G02740, ⁹​OT_13G02270

Transform Ka values into a distance matrix

# Extract sequence names
seq_names <- c(colnames(ka_mat)[-1], rev(ka_mat$seq2)[1])

# Add NA to fill into square distance matrix
ka_mat <- NA |> rbind(ka_mat[, -1]) |> cbind(NA) |>
    tidyr::as_tibble()

# Set column names
colnames(ka_mat) <- seq_names

head(ka_mat)
# A tibble: 6 × 10
  OL13G02120 BPRRCC110…¹ ORCC8…² OL21G…³ OM_08…⁴ MRCC2…⁵ MP09G…⁶ MP09G…⁷ OT_13…⁸
       <dbl>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    NA           NA     NA      NA       NA          NA      NA      NA      NA
2     0.146       NA     NA      NA       NA          NA      NA      NA      NA
3     0.0329       0.151 NA      NA       NA          NA      NA      NA      NA
4    NA            0.146  0.0329 NA       NA          NA      NA      NA      NA
5     0.0570       0.146  0.0719  0.0570  NA          NA      NA      NA      NA
6     0.673        0.699  0.653   0.673    0.715      NA      NA      NA      NA
# … with 1 more variable: MRCC299_06G02640 <lgl>, and abbreviated variable
#   names ¹​BPRRCC1105_04G02530, ²​ORCC809_13G00870, ³​OL21G00860, ⁴​OM_08G02190,
#   ⁵​MRCC299_06G03520, ⁶​MP09G03570, ⁷​MP09G02740, ⁸​OT_13G02270

Use distance matrix to plot bionjs tree

ape::bionjs reconstructs a phylogenetic tree from a distance matrix with possibly missing values:

# Calculate and plot bionjs tree
(ka_mat |> as.dist() |> ape::bionjs()) |>
    plot(cex=1)

Calculate average behavior of each codon

Here, a pre-calculated MSA is necessary:

# load example data
data(hiv, package="MSA2dist")

# calculate average behavior
hiv.xy <- hiv |> dnastring2codonmat() |> codonmat2xy()

Plot average behavior of each codon:

Here’s where you can find me: